This reports analyzes the information presented in the data set
Sales Videogames and its main objective is to show the
process of data cleaning and present the main insights and
recommendations for the industry to thrive in the years to come.
Other than fun for children, youth and adults, video games have become one of the most profitable and commercially attractive industries of this decade. In 2017 alone, the global video game market exceeded $ 100 billion, a figure that marked a milestone in the history of this segment, registering a growth of more than 50% in five years. The market is led by video games for mobile devices (42%), followed by video games for consoles (31%) and for computers (27%), which include physical, downloadable and browser video games.
Currently, video game companies around the world, in addition to attracting and entertaining the consumer, direct their strategies towards the development of comprehensive entertainment systems so that, in collaboration with other industries, they can monetize the interest of the traditional public and even the new consumer segments. In other words, the profits are generated not only by the purchase of the console and the video game, but also include mechanisms such as subscriptions fees of entertainment networks; selling tickets for special events, conventions or tournaments; complementary physical accessories; and integrated shopping and advertising, to name just a few.
Recently, a student from Tecnologico de Monterrey was hired to clean, transform, merge, process, and analyze a dataset which allows them to create a platform to visualize which video games have been the best-selling in the last few years as well as other interesting exploratory and descriptive analysis. This tool will be offered to developer companies in order to identify user preferences and thus, be able to make decisions that open the opportunities for them building greater profitability for their business.
To develop this dataset, the student collected information that includes a video game dataset from 1977 to the end of the second decade of the 21st century. The variables included are: name of the videogame, the genre of the video game, the company that published it (publisher), the console where it is played, as well as the sales generated in different regions of the world by the video game on its launch date (in millions of units), a measure of consumer rating to the videogame, and whether the game won an award.
However, the content of the dataset is not cleaned and the information contains errors that could cause inappropriate decision-making for the end user of the platform. Therefore, the analyst must first clean, transform, merge, and pre-process the dataset in order to be able to analyze it then.
To do the pertinent cleaning of the dataset we used RStudio, using a Markdown file set to HTML. The analysis and ploting was also done in the same R environment.
This project is divided into 2 big objectives.
We used the dataset SalesVideogames to perform all the
analysis. It contains 22 columns that can be summarized in information
about the videogame and sales by region. The variables are:
As mentioned before, the dataset that was used for this job was untidy and had various errors that made it inappropriate for its use. So I could be able to get real insight, and as one of the objectives, a data cleaning process was done.
The first step done was to join all the categories columns (from
Action to Strategy) into 1 that showed them
all and created a column with the value fo the years. This action was
performed with the function gather(). In the same me step,
with the use of %>% (pipes), I replaced all the
Platform.1 for Platform. This is because as it
was previously a column and so there were no repeated names this has a
name change, but as it becomes a value of a variable we can perform the
change. Here I was also able to remove all the missing values from the
variable year, leaving only the actual videogames released. Once created
the new columns, I made sure that category was a factor and
year an integer with the functions as.factor
and as.integer respectively.
| X | Rank | Name | Platform | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | Action | Adventure | Fighting | Misc | Platform.1 | Puzzle | Racing | RolePlaying | Shooter | Simulation | Sports | Strategy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Wii Sports | Wii | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2006 | NA |
| 2 | 2 | Super Mario Bros. | NES | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 | NA | NA | NA | NA | 1985 | NA | NA | NA | NA | NA | NA | NA |
| 3 | 3 | Mario Kart Wii | Wii | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 | NA | NA | NA | NA | NA | NA | 2008 | NA | NA | NA | NA | NA |
| 4 | 4 | Wii Sports Resort | Wii | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2009 | NA |
| 5 | 5 | Pokemon Red/Pokemon Blue | GB | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 | NA | NA | NA | NA | NA | NA | NA | 1996 | NA | NA | NA | NA |
| 6 | 6 | Tetris | GB | Nintendo | 23.20 | 2.26 | 4.22 | 0.58 | 30.26 | NA | NA | NA | NA | NA | 1989 | NA | NA | NA | NA | NA | NA |
| X | Rank | Name | Platform | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | category | year |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 17 | Grand Theft Auto V | PS3 | Take-Two Interactive | 7.01 | 9.27 | 0.97 | 4.14 | 21.40 | Action | 2103 |
| 18 | 18 | Grand Theft Auto: San Andreas | PS2 | Take-Two Interactive | 9.43 | 0.40 | 0.41 | 10.57 | 20.81 | Action | 2004 |
| 24 | 24 | Grand Theft Auto V | X360 | Take-Two Interactive | 9.63 | 5.31 | 0.06 | 1.38 | 16.38 | Action | 2013 |
| 25 | 25 | Grand Theft Auto: Vice City | PS2 | Take-Two Interactive | 8.41 | 5.49 | 0.47 | 1.78 | 16.15 | Action | 2002 |
| 39 | 39 | Grand Theft Auto III | PS2 | Take-Two Interactive | 6.99 | 4.51 | 0.30 | 1.30 | 13.10 | Action | 2001 |
| 45 | 45 | Grand Theft Auto V Online | PS4 | Take-Two Interactive | 3.80 | 5.81 | 0.36 | 2.02 | 11.98 | Action | 2014 |
Following the data cleaning, I had to check the remaining numeric
variables, that where all the sales. The first part here wass to find
the missing values using the function is.na with all 5
variables. In this particular case, only the regions had 4 missing
values each. to replace them, I used the mean for each one and change
all of them with an ifelse() statement.
The variable yearhad some outliers cause by a fail in
the capturing. To find them first I made a point graph that clearly
showed a pattern and those points that were not even close. Also a table
showing the year and the name was created to identify the specific
outliers and to be able to investigate the actual year that videogame
was released. With these identified, using the
str_replace_all() function, I was able to replace all those
outliers. Finally, the same plot was made, but now it had no more
outliers
Plot 1.1 Year Plot with Outliers
Plot 1.1 Year Plot with Outliers
Next, I decided to use the functions str_trim() and
str_to_lower() to solve som minor inconsistencies on our
categorical variables. In this part I also replaced all xbox to xb in
the platform column because they were refering to the same console.
Finally, I created 2 new datasets using gather() to
group all regions into one variable. Both datasets are similar and have
only one difference, the first one gathers only the regions but the
second one also considered Global_Sales. This step was done
to be more useful for the analitics process to perform.
| X | Rank | Name | Platform | Publisher | Global_Sales | category | year | Region | Sales |
|---|---|---|---|---|---|---|---|---|---|
| 17 | 17 | grand theft auto v | ps3 | take-two interactive | 21.40 | Action | 2013 | NA_Sales | 7.01 |
| 18 | 18 | grand theft auto: san andreas | ps2 | take-two interactive | 20.81 | Action | 2004 | NA_Sales | 9.43 |
| 24 | 24 | grand theft auto v | x360 | take-two interactive | 16.38 | Action | 2013 | NA_Sales | 9.63 |
| 25 | 25 | grand theft auto: vice city | ps2 | take-two interactive | 16.15 | Action | 2002 | NA_Sales | 8.41 |
| 39 | 39 | grand theft auto iii | ps2 | take-two interactive | 13.10 | Action | 2001 | NA_Sales | 6.99 |
| 45 | 45 | grand theft auto v online | ps4 | take-two interactive | 11.98 | Action | 2014 | NA_Sales | 3.80 |
| X | Rank | Name | Platform | Publisher | category | year | Region | Sales |
|---|---|---|---|---|---|---|---|---|
| 17 | 17 | grand theft auto v | ps3 | take-two interactive | Action | 2013 | NA_Sales | 7.01 |
| 18 | 18 | grand theft auto: san andreas | ps2 | take-two interactive | Action | 2004 | NA_Sales | 9.43 |
| 24 | 24 | grand theft auto v | x360 | take-two interactive | Action | 2013 | NA_Sales | 9.63 |
| 25 | 25 | grand theft auto: vice city | ps2 | take-two interactive | Action | 2002 | NA_Sales | 8.41 |
| 39 | 39 | grand theft auto iii | ps2 | take-two interactive | Action | 2001 | NA_Sales | 6.99 |
| 45 | 45 | grand theft auto v online | ps4 | take-two interactive | Action | 2014 | NA_Sales | 3.80 |
As part of our second objective, an analysis must be done to be able to answer the initial questions.
All the analysis was performed on R, being able to get some statistical values such as the mean tosolve somo of the problems we had. To explain in a easier way the data, some tables and plots were used.
In this section you will find the solving of the quiestioning proposed ans some insights found through them.
| Region | avg_sales |
|---|---|
| Global_Sales | 0.5374407 |
| NA_Sales | 0.2643281 |
| EU_Sales | 0.1463619 |
| JP_Sales | 0.0777769 |
| Other_Sales | 0.0480270 |
The first behavior we are able to observe here is that the average sales of videogames was at is peak before the 90’s. This is telling us that the most succesful decade, taking into account only the average of the sales, was the 80’s. Here we can also see that for most of the years, North America has been the most succesful region, being surpased clearly only by Japan. Basically, until 1996 they were the top contenders on this aspect, but as the average has decreased to our days, the top regions are North america and Europe. Now, talking about the all time mean sales, both North America and Europe, represent close to a 77% of the Global sales.
| Year | Global |
|---|---|
| 2008 | 690.56 |
| 2009 | 670.21 |
| 2007 | 621.08 |
| 2010 | 606.30 |
| 2011 | 528.68 |
With this graph and table we can see in a very clear way, how
videogames sales behave through the years. According to the data, the
year when the global sales where at its highest was 2008 but 2009 got
really close. The greater sales were done from 2005 to 2011.
In
general we can se that from the first years until 2008, the total Global
sales follows a growing pattern that get to its peak and starts
plummeting until 2017.
Another interesting finding in here was that
the amount of sales follows a peculiar growth pattern. There is a curve
formed by between 3 to 6 years in which there are ups and downs. It
grows for a few tears, reach a peak and goes down again. This cycle
repeats through out the hole graph, but we never get to see the behavior
after the last decrease in sales in 2017.
The first insight found is that before the 90´s, the best selling
videogames were fron the categories
Platform and
Puzzle. For the next decades, we can observe that in
general the sales of all the regions have a similar behavior, with the
exeption of Japan. For all The top categories from 1990 to 2017 were
Action, Sports, Misc and
Shooters in that order, but for japan it is more like
Role Playing, Misc, and Action.
Now, when we compare that to the global sales by category, we can see
that the global pattern is the same as the one from the 3 cattegories
that are not Japan.
| Region | category | SalesxRegion |
|---|---|---|
| NA_Sales | Action | 877.8300 |
| EU_Sales | Action | 525.0000 |
| JP_Sales | RolePlaying | 352.3100 |
| Other_Sales | Action | 186.9261 |
Visually, the first thing that shows up is that the region of North
America represents close to the half of the total sales. This graph
confirms the asumpyions made by analyzing the most successful categories
by year. For North America, Europe and Other regions the top selling
genre was Action and the runner-up was Sports. In the case of Japan Role
Playing got the first place followed by Action.
A very interesting
detail is that for North America, Europe and Other shooters videogames
were in the Top 3, but for Japan, Shooters are the least succesfull
videogames category.
The console that has released more videogames is the Nintendo DS with a total of 2163
The PlayStation2 released 2161 videogames being the second place only 2 releases behind
The category that releases the most videogames is Action, followed by
Sports and in 3rd place is Misc.
Top 1
Top 2
Top 3
Top 4
Top 5
| Region | Publisher | SalesxRegion |
|---|---|---|
| Global_Sales | nintendo | 1786.9100 |
| NA_Sales | nintendo | 816.8700 |
| JP_Sales | nintendo | 455.6878 |
| EU_Sales | nintendo | 418.7400 |
| Other_Sales | electronic arts | 129.7700 |
We can appreciate that for Global sales the most successful publisher
is Nintendo, and it remains like this for all the regions
except from others were EA wins. This surprises because
Nintendo is not the company that released the more videogames, but is
the number 7.
The main insight that we get from this visualization is that succes may
not only be measured By the amount of videogames published or the total
sales. There are some companies that released even only 1 game, but when
getting the relation it holds against the sales it got, they got to the
top. The surprise here is that Electronic Arts, which is the company
that releases the most videogames, is not even in the top 10. In the
case of nintendo that have the most sales, it apears in the top bot it
is not the best one. In general the best is palcom.
The data studied gave us some really important insights.
library(dplyr) #for data manipulation
library(ggplot2) #for graphs and plots
library(treemapify) # a complement to ggplot
library(stringr) #string characters
library(tidyr) #manipulate datasets
library(visdat) #for visualizing missing data
library(assertive) #assert functions
library(forcats) #Manipulate factorssales <- read.csv("SalesVideogames.csv")
glimpse(sales)sales <- sales %>%
gather(key="category", value="year", 11:22) %>%
mutate(category = str_replace_all(category, "Platform.1", "Platform")) %>%
filter(!is.na(year))
sales$category <- as.factor(sales$category)
sales$year <- as.integer(sales$year)
head(sales)# Looking for NA
sum(is.na(sales$NA_Sales))
sum(is.na(sales$EU_Sales))
sum(is.na(sales$JP_Sales))
sum(is.na(sales$Other_Sales))
sum(is.na(sales$Global_Sales))
# Summary of the dataset
summary(sales$NA_Sales)
summary(sales$EU_Sales)
summary(sales$JP_Sales)
summary(sales$Other_Sales)
# Replacing NA for Mean
sales <- sales %>%
mutate(NA_Sales = ifelse(is.na(NA_Sales),0.2643,NA_Sales)) %>%
mutate(EU_Sales = ifelse(is.na(EU_Sales),0.1464,EU_Sales)) %>%
mutate(JP_Sales = ifelse(is.na(JP_Sales),0.07778,JP_Sales)) %>%
mutate(Other_Sales = ifelse(is.na(Other_Sales),0.04803,Other_Sales))
# Confirmation
sum(is.na(sales$NA_Sales))
sum(is.na(sales$EU_Sales))
sum(is.na(sales$JP_Sales))
sum(is.na(sales$Other_Sales))# Visualizing Outliers
sales %>%
ggplot(aes(x=X, y=year)) +
geom_point()
# Changing values
sales$year <- as.character(sales$year)
sales <- sales %>%
mutate(year = str_replace_all(year,"1797","1997")) %>%
mutate(year = str_replace_all(year,"1894","1984")) %>%
mutate(year = str_replace_all(year,"2103","2013")) %>%
mutate(year = str_replace_all(year,"2102","2012")) %>%
mutate(year = str_replace_all(year,"2130","2013")) %>%
mutate(year = str_replace_all(year,"2101","2011")) %>%
mutate(year = str_replace_all(year,"2200","2009")) %>%
mutate(year = str_replace_all(year,"2105","2015"))
sales$year <- as.integer(sales$year)
# Checking
sales %>%
ggplot(aes(x=X, y=year)) +
geom_point()sales <- sales %>%
mutate(Name = str_trim(Name)) %>%
mutate(Name = str_to_lower(Name)) %>%
mutate(Platform = str_trim(Platform)) %>%
mutate(Platform = str_to_lower(Platform)) %>%
mutate(Publisher = str_trim(Publisher)) %>%
mutate(Publisher = str_to_lower(Publisher)) %>%
mutate(Platform = str_replace_all(Platform, "xbox", "xb"))sales2 <- sales %>%
gather(key="Region", value="Sales", 6:9)
sales2
sales3 <- sales %>%
gather(key="Region", value="Sales", 6:10)
sales3graph1 <- sales3 %>%
group_by(year,Region) %>%
summarise(avg_sales=mean(Sales)) %>%
arrange(-avg_sales)
graph1
graph1_1 <- sales3 %>%
group_by(Region) %>%
summarise(avg_sales=mean(Sales)) %>%
arrange(-avg_sales)
graph1_1
ggplot(graph1, aes(x=year, y=avg_sales, color=Region)) +
geom_line()+
labs(x="Year", y="Average Sales", title="Average sales by year by region")+
theme_minimal()graph2 <- sales %>%
select(year, NA_Sales,EU_Sales, JP_Sales, Other_Sales, Global_Sales) %>%
group_by(year) %>%
summarise(Global=sum(Global_Sales)) %>%
arrange(-Global)
graph2
ggplot(graph2, aes(x=year, y=Global))+
geom_line()+
geom_point()+
ylim(c(0,750))+
labs(x="Year", y="Global Sales", title="Global Sales by year")+
theme_light()graph3 <- sales2 %>%
group_by(year,Region,category) %>%
summarise(SalesxRegion = sum(Sales))
graph3
graph3_1 <- sales2 %>%
group_by(Region,category) %>%
summarise(SalesxRegion = sum(Sales))
graph3_1
graph3_2 <- sales2 %>%
group_by(Region,category) %>%
summarise(SalesxRegion = sum(Sales))%>%
top_n(1)
graph3_2
ggplot(graph3, aes(x=category, y=SalesxRegion, fill=category))+
geom_bar(stat = "identity")+
facet_wrap("Region", scales = "free_y")+
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())+
labs(x="", y="Sales", title="Sales by Genre")
ggplot(graph3, aes(x=year, y=SalesxRegion, color=category))+
geom_line()+
facet_wrap("Region", scales = "free_y")+
theme(panel.background = element_rect(fill="white", color = "white"),
panel.grid.major = element_line(size = 0.25, linetype = "solid",color = "lightgrey"),
panel.grid.minor = element_line(size = 0.12,linetype = "solid", color="lightgrey"),
plot.background = element_rect(fill = "lightgreen"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank())
ggplot(graph3_1,aes(area=SalesxRegion, fill=Region, label=category, subgroup=Region))+
geom_treemap()+
geom_treemap_text(colour = "black",
place = "centre",
size = 15)+
ggtitle("Most successful genre of all time for each region")sales %>%
count(Platform) %>%
arrange(-n)
ggplot(sales, aes(x=Platform)) +
geom_bar(fill="lightblue") +
labs(x="Console", y="#Videogames Released", title="Videogames released by platform")+
theme_bw()ggplot(sales, aes(x=category, fill=category)) +
geom_bar() +
labs(x="Category", y="#Videogames Released", title="Videogames released by platform")+
theme_classic()+
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())graph4 <- sales %>%
group_by(Publisher) %>%
count() %>%
filter(n>400) %>%
arrange(-n)
graph4
ggplot(graph4, aes(x=Publisher, y=n, fill=Publisher))+
geom_bar(stat = "identity")+
labs(x="Publisher", y="#Videogames Released", title="Videogames released by publisher")+
theme_linedraw()+
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())graph5 <- sales3 %>%
group_by(Region,Publisher) %>%
summarise(SalesxRegion=sum(Sales)) %>%
top_n(1) %>%
arrange(-SalesxRegion)
graph5
graph5_1 <- sales3 %>%
group_by(Region,Publisher) %>%
summarise(SalesxRegion=sum(Sales))
graph5_1
ggplot(graph5_1,aes(area=SalesxRegion, fill=Region, label=Publisher, subgroup=Region))+
geom_treemap()+
geom_treemap_text(colour = "black",
place = "centre",
size = 15)+
scale_fill_brewer(palette = "Greens")graph6 <- sales3 %>%
group_by(Region,Publisher) %>%
summarise(VGReleased = n(), SalesxRegion = sum(Sales)) %>%
mutate(SalesbyReleases = SalesxRegion/VGReleased) %>%
arrange(-SalesbyReleases)
graph6_1 <- sales3 %>%
group_by(Region,Publisher) %>%
summarise(VGReleased = n(), SalesxRegion = sum(Sales)) %>%
mutate(SalesbyReleases = SalesxRegion/VGReleased) %>%
top_n(1) %>%
arrange(-SalesbyReleases)
graph6_1
graph6_2 <- sales3 %>%
group_by(Region,Publisher) %>%
summarise(VGReleased = n(), SalesxRegion = sum(Sales)) %>%
mutate(SalesbyReleases = SalesxRegion/VGReleased) %>%
top_n(10) %>%
arrange(-SalesbyReleases)
graph6_2
ggplot(graph6,aes(area=SalesbyReleases, fill=Region, label=Publisher, subgroup=Region))+
geom_treemap()+
geom_treemap_text(colour = "black",
place = "centre",
size = 15)+
ggtitle("Most successful genre of all time for each region")+
theme(plot.background = element_rect(fill = "#222222",color = "#222222"),
legend.background = element_rect(fill = "#222222",color = "#222222"),
plot.title = element_text(color="white"),
legend.title = element_text(color="white"),
legend.text = element_text(color="white"))
ggplot(graph6_2,aes(area=SalesbyReleases, fill=Region, label=Publisher, subgroup=Region))+
geom_treemap(show.legend = FALSE)+
geom_treemap_text(colour = "black",
place = "centre",
size = 15)+
ggtitle("Most successful genre of all time for each region")+
theme(plot.background = element_rect(fill = "#222222",color = "#222222"),
legend.background = element_rect(fill = "#222222",color = "#222222"),
plot.title = element_text(color="white"))